Rediscovering the power of pairwise interactions 
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Two recent streams of work suggest that pairwise interactions may be sufficient to capture the 
complexity of biological systems ranging from protein structure to networks of neurons. In one 
approach, possible amino acid sequences in a family of proteins are generated by Monte Carlo 
annealing of a 'Hamiltonian' that forces pairwise correlations among amino acid substitutions to be 
close to the observed correlations. In the other approach, the observed correlations among pairs 
of neurons are used to construct a maximum entropy model for the states of the network as a 
whole. We show that, in certain limits, these two approaches are mathematically equivalent, and 
we comment on open problems suggested by this framework. 



I. INTRODUCTION 

In systems composed of many elements, rich and 
complex behavior can emerge from simple interactions. 
Indeed, for many systems studied by physicists and 
chemists, we can understand almost everything by think- 
ing just about interactions between pairs of elements. 
Sometimes this is an essentially exact statement: (al- 
most) all the complexity of chemical bonding and reac- 
tivity has its origins in the Coulomb interactions among 
electrons and protons, and the total energy associated 
with this interaction is a sum over pairs of particles. Al- 
ternatively, the pairwise description might not be micro- 
scopically exact, but could still be a very good approxi- 
mation, as in the description of many different kinds of 
magnets (ferromagnets, antiferromagnets, spin glasses) 
using only interactions between pairs of spins. In fact, 
pairwise interactions can generate essentially unlimited 
complexity, since finding the ground state of a model 
magnet with arbitrary pairwise interactions among the 
spins is an NP-complete problem 

Could models based on pairwise interactions be pow- 
erful enough to capture the behavior of biological sys- 
tems? One the one hand we know that pairwise inter- 
actions can provide considerable explanatory power, but 
on the other hand restricting our description to pairwise 
interactions is an enormous simplification. The tension 
between the physicists' desire for simplification and the 
biologists' appreciation of complexity is the subject of 
well known jokes 0]. Jokes aside, biological systems of- 
ten have many elements with no obvious geometrical ar- 
guments for simplification, and certainly there are many 
cases where the elements (bases along DNA, amino acids 
along a single protein chain, proteins, cells, ...) have 
many opportunities to interact in combinatorial fashion 
as they generate biological function. Two recent streams 
of work have led to a re-examination of these issues. 

Consider, for example, a protein with N amino acids. 
The structure and function of the protein is determined 
by its sequence, but these often are robust to small 



changes in sequence. More profoundly, a large family 
of proteins may share essential structural and functional 
features while having widely divergent sequences. We 
would like to have a description of this ensemble of se- 
quences, ideally being able to write down the probabil- 
ity distribution out of which functional sequences are 
drawn. Recent work argues that an effective description 
of this sequence ensemble for a protein family can be ob- 
tained by taking account of pairwise correlations among 
amino acids at different sites, but ignoring all higher or- 
der effects [1, |3] . Although this work does not provide 
an explicit construction of the underlying distribution, 
it does provide a Monte Carlo procedure for generat- 
ing new sequences that are consistent with the pairwise 
correlations in known families, and sequences generated 
in this way have been proven experimentally to be fully 
functional. 

What seems like a very different problem is provided 
by networks of neurons. If we look in a small window of 
time, then each neuron either does or does not generate 
an action potential (spike). For two cells chosen at ran- 
dom out of a densely connected collection of neurons, 
one typically finds that pairwise correlations are weak 
but statistically significant. Recently it has been sug- 
gested that the full pattern of correlations among all the 
neurons in such a network can be described by the maxi- 
mum entropy model 0, [|| that is consistent with the ob- 
served pairwise correlations, and this approach has been 
shown to provide successful predictions for the combina- 
torial patterns of activity in the vertebrate retina as it 
responds to natural movies @, [1] • These maximum en- 
tropy models in fact are Ising models with pairwise in- 
teractions, which have long been discussed as schematic 
models for neural networks [13] ; here the Ising model 
emerges directly as the least structured model consistent 
with the experimentally measured patterns of coincident 
spiking among pairs of cells. 

These two different examples represent two very differ- 
ent implementations of the idea that complex structures 
can emerge from pairwise interactions. It is important 
to note that in neither case is there any hope that the 
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pairwise description is microscopically exact. Thus the 
apparent success of the pairwise approximation provides 
a first hint that these systems, despite their complexity, 
are simpler than they might have been. This idea has 
been reinforced by yet more recent application of the 
pairwise, maximum entropy models to other neural sys- 
tems [ll|, EH EH j to a kinase cascade network 14 1 , and 
to the patterns of gene expression in yeast [l]|. 

Biochemical, genetic and neural networks all have dif- 
ferent structures, and the 'networks' of amino acids in a 
protein are yet more different. The mathematical ap- 
proaches taken in Refs [!, H, [23| and @, § also are 
very different, although the theme of pairwise correla- 
tions runs through both analyses. Here we show that 
the mathematical differences are only differences of em- 
phasis: In the relevant limit, the Monte Carlo meth- 
ods of Refs [H, 0, HH generate samples drawn out of the 
maximum entropy probability distribution that would 
be constructed using the methods of Refs @, [1] • Thus 
we have a unified framework for exploring the potential 
of pairwise interactions to tame the complexity of these 
and other biological systems. Within this framework we 
identify some open problems, and comment on the im- 
plicit analogies between neural networks and proteins. 



II. SETTING UP THE PROBLEM 

Let the system we are studying be described by vari- 
ables o\ that are associated with each element or site i, 
where i = 1, 2, • • • , N. For a network of neurons, i can 
label the individual cells, and o\ marks whether that cell 
generated an action potential in a small window of time; 
taken together, the set <ri, cr%, ■ ■ ■ , <jn = {c;} defines the 
pattern of spiking and silence across the whole network. 
For a protein, i is an index into the amino acid sequence, 
and CTj indicates which amino acid is found at site i along 
this sequence; the full sequence is defined by {cti} [la ]. 

We will phrase our discussion in terms of "operators" 
M ({CTi}) on the set of variables {ct;}. The simplest oper- 
ators are the variables <j\ themselves. For neurons, know- 
ing the expectation values (ct;) means that we know the 
probability of cell i generating an action potential in a 
small window of time — the "firing rate" of the cell. For a 
protein, knowing (ui) means that we know the probabil- 
ity of finding each of the twenty possible amino acids at 
position i in the sequence. The next most complicated 
operators involve pairs of variables; knowing the expec- 
tation values of these operators corresponds to knowing 
the probability of two cells generating synchronous ac- 
tion potentials, or the joint probabilities of finding two 
amino acids at particular locations along the protein se- 
quences. The central claim of the recent work reviewed 
above is that knowledge of the expectation values (O^) 
for these "one body" and "two body" operators is suffi- 
cient to describe, at least to a good approximation, the 
functional biological system. 

One approach to using knowledge of the expectation 



values (Ofj,) is to construct a probability distribution for 
the states of the system, P({ai}) that is consistent with 
this knowledge but otherwise is as random or unstruc- 
tured as possible; this is the maximum entropy distribu- 
tion Q . The form of this distribution is given by 



P(Wi}) = 



z({g,}) 



exp 



K 



(i) 



where the partition function Z serves to normalize the 
distribution; 



Z{{9»}) = £ exp 
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(2) 



By analogy with statistical mechanics it is useful to de- 
fine the free energy 



F({g^}) = -In Z({g^}). 



(3) 



Note that there is no real temperature in this problem, 
or equivalently we have chosen units in which ksT = 1. 
The coupling constants g M have to be chosen so that 
the expectation values in this distribution are equal to 
their known values, which is equivalent to solving the 
equations 
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This maximum entropy approach is the one used in re- 
cent work on the analysis of correlations in networks of 
neurons @,[|j]. 

As an alternative to the maximum entropy construc- 
tion, imagine that we create M copies of the system, 
with variables {o - ; 1 ^}, {ct; 2 ^}, • • • , {ct; M ^}. We can evalu- 
ate the empirical expectation values of the operators 
across these M copies, 



1 



M 



M ^ 

n=l 



(5) 



Given each of these empirical expectation values, we can 
try to form a measure of how close these M systems are 
to being representative of the true expectation values. 
Consider 
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where the are weights, expressing how seriously 
we take deviations in each of the individual operators. 
Notice that \ 2 is a function of all M x N variables 
{crp\ CTj 2 \ • • • , CTj M '}. We can try to force these vari- 
ables to be representative of the expectation values (O^) 
by drawing from the probability distribution 
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and then letting T — > 0; again, Zm is a partition function 
that serves to normalize the distribution. This annealing 
procedure is the one used in recent work on the synthesis 
of artificial proteins 0, 0, HH • 



III. MATHEMATICAL EQUIVALENCE OF THE 
TWO METHODS 

Our goal is to show that the probability distribution 
for M copies, Eq ([7]), really is equivalent to the maximum 
entropy distribution, Eq ([1]), in the limit T — > and 
M — ► oo. Interestingly, we will see that in this limit 
the precise values of the weights which enter the 
definition of \ 2 ar e irrelevant. 



To understand the predictions of the annealing 
method, we need to calculate the partition function Zm, 
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It will be useful near the end of our discussion to define 
another free energy G = — In Zm ■ Note that this free en- 
ergy depends on the expectation values {(O^)}, whereas 
the free energy F depends on the coupling constant {g^}- 

To make progress we use the standard approach of 
introducing auxiliary fields </> M to unpack the quadratic 
terms in the exponential: 
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Note that only the last term under the integral depends on the variables {a[ }. Thus when we compute the partition 
function we can take the sum over these variables under the integral and write 
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The crucial piece of this equation is the sum over all possible states of the M copies of the system, but since the M 
copies are independent given {^j-, we can simplify: 
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Now we notice that the sum over states in this expression is just the partition function of the maximum entropy 
distribution, Eq @, if we identify —iip^/M = g^. In this way we can relate the partition function for the annealing 
problem to an integral over the partition function of the maximum entropy problem, 
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The form of Eq (jT5J) suggests that we change variables from ^ to the coupling constants g^, and, recalling that 
z i{9p}) = exp[-F({g M })], we obtain 
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where the effective free energy 
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and T = MT. The partition function of the annealing 
problem involves an integral over coupling constants, and 
the integrand is the exponential of M times a free en- 
ergy that is of order unity as the number of copies M 
becomes large. Thus, as M — ► oo, the integral should be 
dominated by the saddle point where dT '/dg M = for 
all /i, or equivalently by values of the coupling constants 
such that 



(18) 



If we consider the limit f — -> 0, then this equation is 
exactly the same as Eq (UJ) which sets the values of the 
coupling constants in the maximum entropy approach. 
Thus we can write the saddle point approximation to 
Zm as 



Aexp 
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where {g^} are the coupling constants which provide the 
solution to Eq (j4j, and A is a constant that does not de- 
pend on M. Finally, we can extract the large M behavior 
of the free energy G = — In Zm, 
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To understand the result in Eq (f20|) . we should step 
back to our original formulation of the two methods. The 
free energy G in the annealing method refers to M copies 
of the system, so it is not surprising that it is propor- 
tional to M; once we take out this factor we almost get 
the free energy F from the maximum entropy method. 
The difference is that in the maximum entropy formu- 
lation the behavior of the system is a function of the 
coupling constants g^, while in the annealing method 
the free energy is explicitly a function of the expecta- 
tion values {Ofj). This is exactly the situation for the 



Hclmholtz and Gibbs free energies in thermodynamics — 
the Helmholtz free energy is a function of the volume, 
and the Gibbs free energy is a function of the pressure. 
More generally, whenever we have conjugate variables 
(pressure and volume, particle number and chemical po- 
tential, ... ), we use the Legendre transformation to con- 
nect descriptions based on one or the other member of 
the conjugate pair [l7T |. 

For the example of pressure and volume, we have 
G(T,p) = F + pV and hence the familiar differential 
relations 



dF(T, V) 

dV 
dG(T,p) 

dp 



= V. 



(21) 
(22) 



Crucially, F and G are descriptions of the same physical 
system. More strongly, for large systems (here, M — > oo) 
we know that constant pressure and constant volume en- 
sembles are equivalent. For our problem, the analogous 
equations are 



dF({g,}) 
1 dG({{6,)}) 



-0/1- 



(23) 
(24) 



The conclusion is that the annealing method, in the 
M — y oo, T — > limit, describes M independent copies 
of the maximum entropy model. 



IV. SHOULD WE BE SURPRISED? 

The derivation above is a bit circuitous, although it 
does make the connections between the two approaches 
explicit. We can make a somewhat shorter, if less con- 
structive, argument. 

The probability distribution used in the annealing cal- 
culations, Eq ([7]), is a Boltzmann distribution and hence 
a maximum entropy distribution. More precisely, it is 
the maximum entropy distribution consistent with some 
average value of x 2 between the observed and simulated 
expectation values; as usual (x 2 ) is set by the value of T. 
When we let T —> 0, the expectation value of % 2 must 
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approach its minimum value, which is zero, unless there 
is something very odd about the structure of the phase 
space. Thus in the T — > limit, the annealing method 
generates samples from a maximum entropy distribution 
in which the expectation values computed from the M 
simulated copies of the system are exactly equal to the 
observed values. Finally, if we let M — > oo, then what 
we are simulating is an ensemble of samples in which the 
selected expectation values exactly match their experi- 
mental values, but otherwise the distribution of samples 
has maximum entropy. That is, we have drawn samples 
out of the maximum entropy distribution consistent with 
the observed expectation values. 

The subtlety of this argument, which we hope justifies 
the longer calculation above, concerns the combination 
of the limits M — > oo and T — > and the role of the 
weights W,j, in defining \ 2 ■ The careful calculation shows 
that we actually need T = MT — > 0, which is stronger 
than one might have thought, and that in this limit the 
weights are irrelevant. 



V. FINITE SAMPLE SIZE 

Our discussion thus far has assumed that the expecta- 
tion values (Ofj,({(Ji})} are known. In fact we never know 
these expectation value exactly, since our inferences are 
based on a finite data set. For networks of neurons, if 
we define our variables o~\ as the presence or absence of a 
spike from cell i in a small window At = 10 — 20 ms, then 
an experiment of ~ 1 hr provides more than 10 5 samples 
of the state {<7i}, although of course not all these sam- 
ples are independent [7], l8| ■ With these relatively large 
sample sizes, it is plausible that we can approximate ex- 
pectation values, e.g. of the pairwise correlations among 
neurons Cy = (<Ti<7j) — (<7i)(<7j), by the corresponding 
time averages over the experiment, although of course 
with very large networks the number of pairs can become 
comparable to the number of samples and we should be 
careful. For proteins, in contrast, we have only a few 
families with ~ 10 3 known sequences, and in manycases 
one must work from fewer than 100 examples [H, HHHJ . 
Correspondingly the question of how to treat the issues 



of statistical significance in the estimation of of (O m ) 
has been much more at the center of the discussion of 
the protein data. 

Taken at face value, the construction of \ 2 in Eq 
^ gives our estimates of different operators different 
weights W^. One plausible choice for these weights (by 
analogy with usual construction of x 2 ) is to set equal 
to the inverse of the variance in our estimates of (O^). 
This might suggest that the distribution in Eq really 
does represent the probability of finding the empirical 
averages (O^)cmp, with T inversely proportional to the 
number of independent samples N samv in our original 
database. But from the maximum entropy distribution 
we can compute the expected variance in our estimates 
of the expectation values, and this is 
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Unfortunately, the same arguments can be used to show 
that errors in the estimates of the different operators in 
general are not independent, since 
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Thus, while the construction of \ 2 provides a convenient 
heuristic, it can't really represent the likelihood of mea- 
suring empirical expectation values given the true expec- 
tation values. Conveniently, in the limit M — > oo, T — > 0, 
all these concerns disappear and even the precise values 
of the are irrelevant. 

But if the construction of x 2 doesn't capture the sta- 
tistical significance of our estimated expectation values 
correctly, what should we do instead? Once we know 
that we are looking for the maximum entropy distribu- 
tion consistent with a certain set of expectation values, 
we know that the form of the distribution is given by Eq 
(QJ, and our task is to infer the parameters {g,}- If we 
imagine that we have N samp independent samples, with 



states {crji , Op 



}, then the probability of 



observing these data is given by 
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Now if we try to find the parameters {g,} by maximizing this probability (maximum likelihood estimation), we find 
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We see that this is the same as Eq (j4} if we identify the 
"known" expectation values with the averages over our 
finite set of samples. Thus, the maximum entropy con- 
struction can be viewed as maximum likelihood inference 
within a specified class of models, and in this framework 
many questions about the consequences of finite sample 
size can be seen as part of the more general problem of 
learning probabilistic models from data [H, [3| ■ 

In a Bayesian framework, we can construct a probabil- 
ity distribution for the parameters {g^} given the data 

{c^}- Exploring this distribution, e.g. by Monte Carlo 
in parameter space [201 ] . allows us to assign rigorous 
errors to our parameter estimates. More importantly, 
within a Bayesian framework we can integrate over pa- 
rameters to determine the likelihood that a given class of 
models generates the data [U ; this allows us to com- 
pare models in which all interactions are possible with 
those in which some interactions have been set exactly 
to zero. In the context of protein structure, if most inter- 
actions can be set to zero then we can envision the cru- 
cial amino acids as forming limited networks rather than 
being distributed throughout the protein [23|, In 
biochemical, genetic and neural networks, setting many 
interactions to zero would mean describing these net- 
works by a sparsely interconnected graph. It should be 
emphasized that the absence of statistically significant 
correlations between variables <j\ and Oj does not mean 
that there is no interaction between these variables. Al- 
though this program has not been carried out in any 
of the systems studied thus far, the Bayesian approach 
to model selection should provide a rigorous method for 
deciding on the number of significant interactions [25j |. 



VI. DISCUSSION 

As we collect more and more quantitative data on bi- 
ological systems, it becomes increasingly urgent to find 
a theoretical framework within which these data can be 
understood. In many cases, one approach to this prob- 
lem involves writing down a probability distribution that 
describes some network of interacting variables: 

• In the context of protein evolution, we would like 
to write down the probability that any particular 
amino acid sequence will arise as a functional pro- 
tein in a certain family. 

• In the context of neural networks, we would like to 
write down the probability that the network will 
exhibit any particular pattern of spiking and si- 
lence. 



• In the context of genetic networks, we would like 
to write down the probability that a cell will ex- 
hibit any particular combination of gene expression 
levels, either under a fixed set of conditions or av- 
erages over its lifetime. 

One might object that such probabilistic descriptions 
are not consistent with the search for a more 'rule based' 
understanding; surely, for example, some sequences form 
functional proteins and some do not. Without entering 
into a philosophical discussion about whether degrees of 
functionality can be mapped into probabilities, we note 
that as the systems we are studying become large, our 
intuition from statistical mechanics is that the difference 
between a description in which states are assigned prob- 
abilities (the canonical ensemble) and one in which some 
states are allowed and the rest are not (the microcanon- 
ical ensemble) becomes vanishingly small. Thus, even if 
we start with a probabilistic description, once we think 
about proteins with many amino acids or networks con- 
structed from many neurons or genes, we expect that 
our descriptions will converge to one in which there is a 
sharp distinction between allowed and disallowed com- 
binations of the underlying variables. 

The fact that we are interested in networks with many 
variables makes the task of constructing a probabilistic 
description quite daunting. With N elements, there are 
~ exp(aiV) possible combinations of the underlying vari- 
ables, where a typically is of order unity. Obviously no 
experiment will exhaustively explore this configuration 
space, just as no experiment exhaustively explores the 
configuration space of the spins in even a small magnetic 
grain. For the magnetic grain, however, there is a limited 
set of natural macroscopic variables to measure — such 
as the magnetization, specific heat, and susceptibility — 
that seem to provide a good characterization of the states 
available to the system. Can we hope for some analogous 
simplification in the context of biological networks? 

Quantities such as the magnetization and susceptibil- 
ity can be written as averages over the Boltzmann distri- 
bution. More precisely, the magnetization describes the 
average behavior of individual spins, and the susceptibil- 
ity describes the average behavior of pairs of spins (two- 
point correlations). The analogous idea, then, is that 
our description of biological systems might be simplified 
by focusing on correlations between pairs of elements, 
rather than allowing for arbitrarily complex combinato- 
rial interactions among many elements. This is precisely 
the strategy adopted in recent work on protein sequences 
[3, U, |23( and on networks of real neurons [7], |8| . In each 
case, although the focus on pairwise interactions was 
implemented differently, the approach was surprisingly 
successful, accounting for data far beyond the measured 
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correlations. What we have shown here is that the ap- 
proaches which arose in different contexts really are the 
same, so we have a single strategy for simplifying our 
description of biological systems that seems to be work- 
ing at very different levels of organization, from single 
molecules [!, H|, [23[ to biochemical and genetic networks 
[3 [H up to small chunks of the brain @, i, 03, E 03 . 
While much remains to be done to test the limits of this 
approach, this is a very exciting development. 

The annealing approach which was used in the analysis 
of protein sequences allows us to generate directly new 
samples from the simplified probabilistic model, without 
actually constructing the model explicitly. For proteins, 
these samples are new molecules that can be synthesized, 
and this has been the path to experimental test of the 
focus on pairwise correlations. One could imagine using 
the same method for neurons to generate new patterns 
of spiking and silence in the network, and one could then 
check that the higher-order correlations in these patterns 
(beyond the pairwise correlations which are matched ex- 
actly) agree with experiment. 

The explicit construction of the maximum entropy 
model, as has been done in the analysis of neurons, al- 
lows us to explore the "thermodynamics" of the system. 
Questions one can address include whether the space of 
configurations breaks into multiple basins, and whether 
the parameters of the biological system are in any sense 



special, e.g. because they are near a critical point. Per- 
haps the most direct question we can address given a 
maximum entropy model for the distribution of states 
concerns the entropy itself. In the context of neurons, 
this entropy sets the capacity for the system to convey 
information, whether about the external sensory inputs 
or about some internal variables such as memories and 
intentions. In the context of proteins, this entropy mea- 
sures the number of possible sequences that are consis- 
tent with membership in the particular family of func- 
tional proteins that we are studying. An explicit model 
for the distribution of sequences within a family is also 
the proper tool for assessing the likelihood that a pre- 
viously uncharacterized protein belongs to this family, a 
practical problem of central importance in analyzing the 
growing body of sequence data. 
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